Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DPLASMA Warmup -- 2nd try #69

Merged
merged 6 commits into from
Jun 27, 2023
Merged

DPLASMA Warmup -- 2nd try #69

merged 6 commits into from
Jun 27, 2023

Conversation

therault
Copy link
Contributor

This is a second try for solving the warmup issue in DPLASMA (especially in CUDA codes).

Here are some performance measurements of the approach proposed in this PR, on Leconte (8x V100):

dpotrf-leconte-warmup-vs-nowarmup-avg-white

dpotrf-leconte-warmup-vs-nowarmup-details-white

'gflops/avg' represents the ratio 'gflops of this run' divided by the 'appropriate average' (meaning the average without the outlier on runs without warmup, and the average of all measured points on runs with warmup).

There is still some warmup problem that is unidentified, at large tile size (512, 1024), for 1 to 4 GPUs, the first actual run is still slower than the others for the small problem sizes. It's unclear what is the source of the issue at this point, but the warmup patch fixes most of the CUDA/CUBLAS warmup issues.

The goal of the current code is to include changes for all tests that feature a CUDA implementation and timing:

  • POTRF
  • GEMM (WIP)
  • POINV
  • GEQRF

TRSM is the last kernel that features a CUDA implementation, and it does not have timing in its testings.

Aurelien has notified during the discusssion of an issue with HIP: allocation of memory on the HIP device was lazy at some point, and allocation at first touch is a significant part of the warmup overheads of the HIP runs. We decided that this should be solved at the PaRSEC level, during memory allocation, and not at the DPLASMA warmup level.

@therault
Copy link
Contributor Author

Done: all PTG tests that are CUDA-enabled

To do:

  • DTD tests that are CUDA-enabled
  • Update performance of CUDA-enabled tests on recent machines

@bosilca
Copy link
Contributor

bosilca commented Mar 31, 2023

As discussed on 03/31/23 we need to

  • cover the CPU only tests (in addition to the device tests)

  • need to touch the entire data at least once on the GPU (at the level of PaRSEC data during the allocation stage).

@therault
Copy link
Contributor Author

therault commented Apr 6, 2023

  • check what is happening in paranoid debug mode: tests on frontier show that the local data dists generated via the warmup calls trigger (wrongful?) asserts
  • GPU load statistics should be reset after the completion of each warmup test. Shouldn't that happen automatically via the completion of tasks? parsec#01592dc6 adds a function to do that (explicit call)

@therault therault marked this pull request as ready for review June 2, 2023 14:44
@therault therault requested a review from a team as a code owner June 2, 2023 14:44
… in testers, to provide a simple way to get consistent performance results.

Implementation of warmup_zpotrf in testing_zpotrf.c

testing_zpotrf: use zplghe and not zplrnt to initialize symetric positive definite matrices; call the warmup function as it has been validated experimentally
Port warmup to testing_zpoinv
Port warmup to QR (PTG). Looks like CUDA-QR is having some issues.
Support warmup for GEMM -- Only assign a preferred device in zgemm_NN_gpu.jdf if the upper level has not assigned one, allowing the user to control finely where tasks will execute if they want (and the warmup process definitely wants to control that)

Update to current parsec, and enable warmup in testing_zgebrd_ge2g

Port warmup to testing_zgelqf_hqr

Port testing_zgelqf_systolic

Fix some bugs in testing_zpotrf.c's warmup

Add warmup for zgetrf* zpoinv, and zpotrf_dtd*

Use the same zgeqrf warmup for dtd tests

Use the same warmup for testing_zgemm and testing_zgemm_dtd

Porting warmup on zgelqf

Add loop and warmup to testing zheev

Add warmup and performance measurement loop to GEQRF HQR and Systolic

Inplement new warmup strategy when no-known GPU implementation exists

 - if there is a known GPU implementation, just assume we need to warmup
   once per device
 - if there is no known GPU implementation, iterate over the task classes,
   and check if a GPU implementation exists. If it is the case, run a warmup
   for each device of that type. That codepath will be skipped until someone
   implements a GPU version for all operations... Worst case, it will not
   properly work, and not break the test. Best case, we will not forget to
   do warmup for GPU cases.

Add the warmup/loop to ZGESVD

Fix GEMM warmups: GEMM uses reshaping to support ScaLAPACK + TILED
data representations, and the data collection wrapper does not
work well with the hack of changing the rank_of function in the
source data collections. Simply do a 1D distribution of A and C
over all the ranks to ensure that all processes initialize GEMM
in the warmup.
@abouteiller
Copy link
Contributor

  • GPU load statistics should be reset after the completion of each warmup test. Shouldn't that happen automatically via the completion of tasks? parsec#01592dc6 adds a function to do that (explicit call)

This is done in #89

src/zgeqrf.jdf Show resolved Hide resolved
@abouteiller abouteiller merged commit a0ea91c into ICLDisco:master Jun 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants